13.2 Botryology
161
Fig. 13.1 Each object is represented by a cross corresponding to its value of the chosen feature on
the real line double struck upper RR. Clusters C1, C2, and C3 are easily identifiable. The spot on the line represents a
possible value that could be used to divide the set dichotomously
waking hours (in other words, the comparison of unknown objects with known pro-
totypes), of more current interest in bioinformatics is the unsupervised discernment
of patterns in, for example, gene and genome sequences, especially since the pro-
portion of unknown material in genomes is still overwhelming. 5 A very powerful
methodology for achieving that is to examine whether the data resulting from some
operations carried out on a DNA sequence (for example) can be arranged in such a
way that structure appears, namely that groups of data points constituting a subset
of the entire dataset are clumped together to form two or more distinct entities.
The clustering process is defined as the partition of a set of objects by some features
into disjoint subsets, and each subset in which objects are united by some features
is called the cluster. If no relation between the objects is known, it is impossible to
construct clusters.
The simplest case of clustering arises when only one feature exists; each object
under consideration either has the feature or does not, in which case the maximum
number of clusters is two and, if it happens that all the objects have that feature, then
there will be only one cluster. Another simple case arises if values from the real-
number line can be attributed to the feature (Fig. 13.1). This is easily generalized to
two or more dimensions, the number of dimensions being equal to the number of
chosen features.
If the set of objects is large and many features have been chosen, it is necessary
to have algorithms for clustering that enable it to be carried out automatically on
the computer. Many such algorithms are known; a few of them are briefly described
below. 6 It is assumed that there is a set StartSet upper X EndSet{X} of objects (upper X Subscript iXi, etc.) in upper NN-dimensional
feature space. For ease of representation, we will tacitly consider upper N equals 2N = 2.
Hyperspheres. A circle of radius rr is drawn around an arbitrarily chosen object.
Objects within the circle form the first subcluster. New circles are now drawn with
their centres at these other objects, which encompass yet more objects, around which
new circles are again drawn, and so forth until no new objects are added. If all of
the objects in the set are now included, the process has failed. If, on the other hand,
objects remain, then one of those remaining objects is arbitrarily chosen and the
process is repeated.
5 We also have the intermediate process of semisupervised learning, which deals with the problem
of combining small amounts of labelled data with large amounts of unlabelled data—the classic
paper is Zhu et al. (2003).
6 See Verulava et al. (2009) for the “rank of links” method.